An Improved Algorithm of Logical Structure Reconstruction for Re-flowable Document Understanding
نویسندگان
چکیده
The basic idea of re-flowable document understanding and automatic typesetting is to generate logical documents by judging the hierarchical relationship of physical units and logical tags based on the identification of logical paragraph tags in re-flowable document. In order to overcome the shortages of conventional logical structure reconstruction methods, a novel logical structure reconstruction method of re-flowable document based on directed graph is proposed in this paper. This method extracts the logical structure from the template document and then utilizes directed graph's single-source shortest path algorithm to filter out redundant logical tags, thus solving the problem of logical structure reconstruction of a document. Experimental results show that the algorithm can effectively improve the accuracy of logical structure recognition.
منابع مشابه
روش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملAn Improved Algorithm for Network Reliability Evaluation
Binary Decision Diagram (BDD) is a data structure proved to be compact in representation and efficient in manipulation of Boolean formulas. Using Binary decision diagram in network reliability analysis has already been investigated by some researchers. In this paper we show how an exact algorithm for network reliability can be improved and implemented efficiently by using CUDD - Colorado Univer...
متن کاملStochastic attribute grammar model of document production and its use in document image decoding
Document Image Decoding (DID) refers to the process of document recognition within a communication theory framework. In this framework, a logical document structure is a message communicated by encoding the structure as an ideal image, transmitting the ideal image through a noisy channel, and decoding the degraded image into a logical structure as close to the original message as possible, on a...
متن کاملAn improved genetic algorithm for multidimensional optimization of precedence-constrained production planning and scheduling
Integration of production planning and scheduling is a class of problems commonly found in manufacturing industry. This class of problems associated with precedence constraint has been previously modeled and optimized by the authors, in which, it requires a multidimensional optimization at the same time: what to make, how many to make, where to make and the order to make. It is a combinatorial,...
متن کاملMachine Learning for Reading Order Detection in Document Image Understanding
Document image understanding refers to logical and semantic analysis of document images in order to extract information understandable to humans and codify it into machine-readable form. Most of the studies on document image understanding have targeted the specific problem of associating layout components with logical labels, while less attention has been paid to the problem of extracting relat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015